CelFiE-ISH: a probabilistic model for multi-cell type deconvolution from single-molecule DNA methylation haplotypes

Deconvolution methods infer quantitative cell type estimates from bulk measurement of mixed samples including blood and tissue. DNA methylation sequencing measures multiple CpGs per read, but few existing deconvolution methods leverage this within-read information. We develop CelFiE-ISH, which extends an existing method (CelFiE) to use within-read haplotype information. CelFiE-ISH outperforms CelFiE and other existing methods, achieving 30% better accuracy and more sensitive detection of rare cell types. We also demonstrate the importance of marker selection and of tailoring markers for haplotype-aware methods. While here we use gold-standard short-read sequencing data, haplotype-aware methods will be well-suited for long-read sequencing. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-024-03275-x.

(1) The observed data log-likelihood is:

Q function
As z in unknown, we define p as the probability of z: P (z t,c = 1|α, β) =: pt,c Q is the expected value of the log-likelihood function.
At iteration i, the Q-function is:

E-step
In the E-step we estimate the latent variable z and use it to define the Q function.

M-step
In the M-step we maximize the Q function, holding the estimate for the latent variable z constant and maximizing α.

Reference Atlas
The reference atlas consists of two matrices, Y t,m and D Y t,m , with the number of methylated and total reads for cell type t at position m respectively.We assume Y t,m is drawn from a Binomial distribution with β t,m being the true methylation probability and D Y t,m being the number of trials.We re-estimate the atlas at each iteration.

Mixture
The mixture is one matrix X, with dimensions C reads over M CpG sites.

Likelihood
The observed data likelihood is: The observed data log-likelihood is: The complete data likelihood is: The first term is The second term is The third term is

Q function
As z in unknown, we define p as the probability of z: Q is the expected value of the log-likelihood function.
At iteration i, the Q-function is:

E-step
In the E-step we estimate the latent variable z and use it to define the Q function.

M-step
In the M-step we maximize the Q function, holding the estimate for the latent variable z constant and maximizing α.
Next, we re-estimate the atlas: The Epistate Model At every marker region, reads are drawn from one of two possible epistates: θ high and θ low .Each epistate consists of a set of binomial distributions θ = {θ 1 , θ 2 , ..., θ m }, one per CpG site covered by the marker region.θ high is arbitrarily defined to be the epistate with higher mean methylation.Cell types differ by the probability of observing each epistate in each region.

Reference Atlas
The reference atlas consists of one matrix λ t,c , with the probability of observing θ high under cell type t at read c.Within a genomic region λ does not vary between reads, leaving λ t .Additionally, for every position we know θ high,m and θ low,m (see below).The overall probability of methylation per position is: 14 Mixture The mixture is one matrix X, with dimensions C reads over M CpG sites.

Likelihood
The observed data likelihood is: The observed data log-likelihood is: z is the indicator for α and µ is the indicator for λ.The complete data likelihood is: P (x, z, µ|α, θ high , θ low , λ) = P (x|µ, θ high , θ low )P (z|α)P (µ|z, λ) The first term is The second term is The third term is

Q function
As z in unknown, we define p as the posterior probability of z: Similarly, P (µ c = 1|z, x) =: qc Note that λ, θ high , θ low and by extension β are always given and not reestimated.For simplicity, we left them out of the conditional statements.Q is the expected value of the log-likelihood function.
At iteration i, the Q-function is:

E-step
In the E-step we estimate the latent variables z and µ and use them to define the Q function.
Since µ can only take on two values, we constrain As above: Finally: We do the same for z: Then normalize so that every read comes from a cell type.

M-step
In the M-step we maximize the Q function, holding the estimate for the latent variables constant and maximizing α.The only term in the Q function with α is identical to CelFiE and CelFiE+, so the maximization step is the same.
Estimating Epistates in the Reference Atlas For each marker region in the Epistate reference, we estimate Θ high , Θ low and λ t .First, we jointly examine all reads from the entire reference dataset.We assume each read is associated with either Θ high or Θ low .υ j is the prior probability for epistate j ∈ [1, 2].At the expectation step, we update the posterior probability of each read P j,c given Θ.At the maximization step, we estimate the hidden state Θ, and υ j .

Likelihood
The observed data likelihood is: Maximization Then, we split the reference by cell type.For each cell type, λ if the probability of observing Θ high .For each subset: . ., Y n ] be a vector of true cell type fractions in a mixture, ordered from smallest to largest Y 1 ≤ Y 2 ≤ . . .≤ Y n and Ŷ = [ Ŷ1 , Ŷ2 , . . ., Ŷn ] be the estimated values.The RMSE is defined as As these are fractions we can add the constraint that n i=1 Y i = 1 and 0 ≤ Y i ≤ 1 for all i.This is also true for the estimates: n i=1 Ŷi = 1 and 0 ≤ Ŷi ≤ 1 for all i.
For the worst-case estimation, i.e. the largest RMSE, let Ŷ1 = 1 and Ŷi = 0 for i ̸ = 1.The squared error terms are then (1 − Y 1 ) 2 for i = 1 and Y 2 i for i ̸ = 1.To prove this results in the maximum RMSE, consider any other estimate Ŷ ′ .This implies, for some j ̸ = 1, Ŷ ′ j > 0. The squared error term would then be (1 The entire expression is therefore smaller than the worst-case estimation.Intuitively, since Y 1 is the smallest, its error term has the largest impact on increasing the RMSE when estimated far from its true value.Thus, any other estimation would result in a lower RMSE.

WGBS Data Processing
In order to convert BAM files to the Biscuit epiread format, we first generated a SNP file from the VCF files requiring GQ ≥ 15 for positions overlapping a dbSNP common allele, and requiring GQ ≥ 60 for all other positions.DbSNP common allele table was downloaded from UCSC for the hg19 assembly, and was processed with: https://github.com/ekushele/methylseq/blob/master/bin/processUcscDbsnp.pl.